In [1]:

    
%pylab inline

import matplotlib.pyplot as plt
import networkx as nx
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
import urllib2

pd.options.display.mpl_style = 'default'









    



Populating the interactive namespace from numpy and matplotlib

This notebook is going to explore a method of text featurization that I will call nframe.

Many basic natural language processing techniques depend on a "bag of words" document model. This model strips away all syntactic information from a document and only looks at term frequency. A term can be one word or more than oneadjacent 'word'; this class of features is referred to collectives n-grams. These bag of n-gram models are widely used for applications like sentiment analysis, classification, and topic modelling. They are a useful representation for these applications because they represent documents as vectors and corpuses as matrices, data structures that computers are fast at processing.

I am interested in using NLP to compare the meanings of different documents. So I have been working on a "new" technique. Likely, it is not really new and I have just not done the requisite amount of literature review. Reconciling this work with existing literature on e.g. skip-gram featurization remains to be done.

To start, let's get a book with several chapters. For simplicity, I have picked Yertle the Turtle, by Dr. Seuss, and treat each stanza as a separate chapter. I encourage you to try this on your own favorite freely available book. Longer books will be more difficult to process and it will be harder to interpret the results.



In [2]:

    
book_url = "http://www.spunk.org/texts/prose/sp000212.txt"

book = urllib2.urlopen(book_url).read()

print book[:1250]









    



Yertle the Turtle
-----------------

by dr. seuss

On the far-away island of Sala-ma-Sond,
Yertle the Turtle was king of the pond.
A nice little pond.  It was clean.  It was neat.
The water was warm.  There was plenty to eat.
The turtles had everything turtles might need.
And they were all happy.  Quite happy indeed.

They were... untill Yertle, the king of them all,
Decided the kingdom he ruled was too small.
"I'm ruler", said Yertle, "of all that I see.
But I don't see enough.  That's the trouble with me.
With this stone for a throne, I look down on my pond
But I cannot look down on the places beyond.
This throne that I sit on is too, too low down.
It ought to be higher!" he said with a frown.
"If I could sit high, how much greater I'd be!
What a king! I'd be ruler of all that I see!"

So Yertle, the Turtle King, lifted his hand
And Yertle, the Turtle King, gave a command.
He ordered nine turtles to swim to his stone
And, using these turtles, he built a new throne.
He made each turtle stand on another one's back
And he piled them all up in a nine-turtle stack.
And then Yertle climbed up.  He sat down on the pile.
What a wonderful view! He could see 'most a mile!

"All mine!" Yertle cried.  "Oh, the things I now rule!
I'm the kin

We are going to nframe each chapter of the book. Look at the text of the book and see if you can split it into chapters using basic Python tools.



In [3]:

    
# This is a regular expression for the chapter divides in the book
chapter_divide_re = "\n\n"

# We need to clear out Project Gutenberg's line breaks before further processing
to_remove ="\n"

def chapters(book):
    chapters = re.split(chapter_divide_re,book)
    return [re.sub(to_remove," ",c) for c in chapters]

for i,c in enumerate(chapters(book)):
    print "Chapter " + str(i)
    print c[:200]
    print









    



Chapter 0
Yertle the Turtle -----------------

Chapter 1
by dr. seuss

Chapter 2
On the far-away island of Sala-ma-Sond, Yertle the Turtle was king of the pond. A nice little pond.  It was clean.  It was neat. The water was warm.  There was plenty to eat. The turtles had everythin

Chapter 3
They were... untill Yertle, the king of them all, Decided the kingdom he ruled was too small. "I'm ruler", said Yertle, "of all that I see. But I don't see enough.  That's the trouble with me. With th

Chapter 4
So Yertle, the Turtle King, lifted his hand And Yertle, the Turtle King, gave a command. He ordered nine turtles to swim to his stone And, using these turtles, he built a new throne. He made each turt

Chapter 5
"All mine!" Yertle cried.  "Oh, the things I now rule! I'm the king of a cow! And I'm the king of a mule! I'm the king of a house! And, what's more, beyond that I'm the king of a blueberry bush and a 

Chapter 6
And all through the morning, he sat up there high Saying over and over, "A grat king am I!" Until 'long about noon.  Then he heard a faint sigh. "What's that?" snapped the king And he looked down the 

Chapter 7
"SILENCE!" the King of the Turtles barked back. "I'm king, and you're only a turtle named Mack."

Chapter 8
"You stay in your place while I sit here and rule. I'm the king of a cow! And I'm the king of a mule! I'm the king of a house! And a bush! And a cat! But that isn't all.  I'll do better than that! My 

Chapter 9
"Turtles! More turtles!" he bellowed and brayed. And the turtles 'way down in the pond were afraid. They trembled.  They shook.  But they came.  They obeyed. >From all over the pond, they came swimmin

Chapter 10
Then Yertle the Turtle was perched up so high, He could see fourty miles from his throne in the sky! "Hooray!" shouted Yertle.  "I'm the king of the trees! I'm king of the birds! And I'm king of the b

Chapter 11
Then again, from below, in the great heavy stack, Came a groan from that plain little turtle named Mack. "Your Majesty, please... I don't like to complain, But down here below, we are feeling great pa

Chapter 12
"You hush up your mouth!" howled the mighty King Yertle. "You've no right to talk to the world's highest turtle. I rule from the clouds! Over land! Over sea! There's nothing, no, NOTHING, that's highe

Chapter 13
But, while he was shouting, he saw with suprise That the moon of the evening was starting to rise Up over his head in the darkening skies. "What's THAT?" snorted Yertle.  "Say, what IS that thing That

Chapter 14
But, as Yertle, the Turtle King, lifted his hand And started to order and give the command, That plain little turtle below in the stack, That plain little turtle whose name was just Mack, Decided he'd

Chapter 15
And Yertle the Turtle, the king of the trees, The king of the air and the birds and the bees, The king of a house and a cow and a mule... Well, that was the end of the Turtle King's rule! For Yertle, 

Chapter 16
And tosay the great Yertle, that Marvelous he, Is King of the Mud.  That is all he can see. And the turtles, of course... all the turtles are free As turtles and, maybe, all creatures should be.

When we nframe a chapter, we split it into many sub-documents in a principled way. For example, we can look at the many individual sentences in a chapter. This preserves some of the syntactic information for the chapter without requiring a potentially computationally or conceptually burdensome syntax parser. There's a lot of latitude in how you separate a chapter into sentences. I'm going to do it in a simple way.



In [4]:

    
# The regular expression that we will split sentences on.
# This could get more complicated if you're more sensitive
# to quotation marks, for example.
sentence_divide_re = "[\.\!\?]"

def sentences(chapters):
    return [re.split(sentence_divide_re,c) for c in chapters]

for i,ss in enumerate(sentences(chapters(book))):
    print "Chapter " + str(i) + ", " + str(len(ss)) + " sentences"
    for i in range(len(ss)):
        if i > 2:
            break
        else:
            print ss[i]









    



Chapter 0, 1 sentences
Yertle the Turtle -----------------
Chapter 1, 2 sentences
by dr
 seuss
Chapter 2, 10 sentences
On the far-away island of Sala-ma-Sond, Yertle the Turtle was king of the pond
 A nice little pond
  It was clean
Chapter 3, 15 sentences
They were


Chapter 4, 8 sentences
So Yertle, the Turtle King, lifted his hand And Yertle, the Turtle King, gave a command
 He ordered nine turtles to swim to his stone And, using these turtles, he built a new throne
 He made each turtle stand on another one's back And he piled them all up in a nine-turtle stack
Chapter 5, 11 sentences
"All mine
" Yertle cried
  "Oh, the things I now rule
Chapter 6, 11 sentences
And all through the morning, he sat up there high Saying over and over, "A grat king am I
" Until 'long about noon
  Then he heard a faint sigh
Chapter 7, 4 sentences
"SILENCE
" the King of the Turtles barked back
 "I'm king, and you're only a turtle named Mack
Chapter 8, 12 sentences
"You stay in your place while I sit here and rule
 I'm the king of a cow
 And I'm the king of a mule
Chapter 9, 13 sentences
"Turtles
 More turtles
" he bellowed and brayed
Chapter 10, 15 sentences
Then Yertle the Turtle was perched up so high, He could see fourty miles from his throne in the sky
 "Hooray
" shouted Yertle
Chapter 11, 12 sentences
Then again, from below, in the great heavy stack, Came a groan from that plain little turtle named Mack
 "Your Majesty, please

Chapter 12, 8 sentences
"You hush up your mouth
" howled the mighty King Yertle
 "You've no right to talk to the world's highest turtle
Chapter 13, 12 sentences
But, while he was shouting, he saw with suprise That the moon of the evening was starting to rise Up over his head in the darkening skies
 "What's THAT
" snorted Yertle
Chapter 14, 7 sentences
But, as Yertle, the Turtle King, lifted his hand And started to order and give the command, That plain little turtle below in the stack, That plain little turtle whose name was just Mack, Decided he'd taken enough
  And he had
 And that plain little lad got a bit mad
Chapter 15, 7 sentences
And Yertle the Turtle, the king of the trees, The king of the air and the birds and the bees, The king of a house and a cow and a mule


Chapter 16, 7 sentences
And tosay the great Yertle, that Marvelous he, Is King of the Mud
  That is all he can see
 And the turtles, of course

Next, for each chapter, we are going to create a document-term matrix for each of its sentences. This is just the old bag of words model everybody is using, and there's lots of open software packages that do this work.



In [5]:

    
# see here for how to add lemmatization to the tokenizer
# http://scikit-learn.org/stable/modules/feature_extraction.html

vectorizer = CountVectorizer(
                     # token_pattern=r'\b\w+\b',
                     #min_df=2,
                     #max_df=0.6,
                     binary=True,
                     stop_words='english',
                     )

#cruft??
#analyzers = vectorizer.build_analyzer()

# For now, each chapter will have its own feature-index mapping.
# This may make comparison difficult later on.
models = [(vectorizer.fit_transform(ss),vectorizer.get_feature_names()) for ss in sentences(chapters(book))]

# Note that the model is a sparse matrix and each column of the matrix
# is associated with a single word term.
print models[4]









    



(<8x24 sparse matrix of type '<type 'numpy.int64'>'
	with 26 stored elements in Compressed Sparse Column format>, [u'built', u'climbed', u'command', u'gave', u'hand', u'king', u'lifted', u'mile', u'new', u'ordered', u'pile', u'piled', u'sat', u'stack', u'stand', u'stone', u'swim', u'throne', u'turtle', u'turtles', u'using', u'view', u'wonderful', u'yertle'])

Here's the big idea. Now we're going to turn the document-term matrix of sentences into one term cooccurence matrix for each chapter.

Word on the street is that cooccurence matrices have been used to represent word semantics by Google and others to do amazing things. Or maybe cooccurence is a red herring and the real magic is in deep learning inspired systems like word2vec. Since deep learning is complicated and hard to intepret, I'm using coocurence matrices (and the implied semantic networks) as a poor man's proxy for a document specific vecor representation of the meaning of each word.

Hey I just met you, and this is crazy; but here's my number so call me maybe.



In [6]:

    
models[4][0].T * models[4][0]









    Out[6]:





<24x24 sparse matrix of type '<type 'numpy.int64'>'
	with 140 stored elements in Compressed Sparse Row format>



In [7]:

    
# compute cooccurence matrix from document-term matrix
def cooccurences(dtm):
    return dtm.T * dtm

nframes = [(cooccurences(dtm),f) for dtm,f in models]

Now we have what we need to nframe a corpus of chapters. Each nframing is a purely descriptive matrix of within-document word cooccurences. It takes all the same parameters as the standard bag of words model (size of n-gram, stop word reduction, stemming, lemmatization), and an additional parameter which is the sentence splitting function. You can experiment with different parameters using this notebook.

Let's visualize one of these nframings.



In [8]:

    
def nframe_pcolor(coc,feature_names):
    coc = coc.toarray()

    plt.pcolor(coc,cmap=matplotlib.cm.Greys)
    plt.yticks(np.arange(0.5, len(feature_names), 1), feature_names) # these lines would show labels, but that gets messy
    #plt.xticks(np.arange(0.5, len(feature_names), 1), feature_names)
    plt.axis([0, len(feature_names), 0, len(feature_names)])



In [9]:

    
plt.figure(1,figsize=(12, 12))
nframe_pcolor(*nframes[3])
plt.show()









    



/home/sb/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

An nframing can also be read as an adjacency matrix between terms. This is a lot like a model for representing meaning that's been used for many years in cognitive psychology called a semantic network. A semantic network shows how strongly associated different words are as a network with weighted edges. One way to think about nframe is that it creates a semantic network specific to each chapter, or document in the corpus.

Let's visualize one of the nframings as a semantic network.



In [10]:

    
def nframe_to_semantic_network(coc,feature_names):
    coc = coc.toarray()
    G = nx.from_numpy_matrix(coc)
    occurence = dict([(i,float(coc[i,i])) for i in range(coc.shape[0])])
    nx.set_node_attributes(G,'occurence',occurence)
    G= nx.relabel_nodes(G,dict(enumerate(feature_names)),copy=True)
    return G



In [11]:

    
G = nframe_to_semantic_network(*nframes[3])



In [12]:

    
def occ_to_size(occ):
    return log(occ+1)*300

def draw_semantic_network(G):
    occurence = [u[1]['occurence'] for u in G.nodes(data=True)]
    sizes = [log(o+1)*300 for o in occurence]
    widths = [G[u][v]['weight'] for u, v in G.edges()]
    
    pos = nx.graphviz_layout(G, prog='neato')
    
    # positive value nodes
    plus_nodes = [u for u,d in G.nodes(data=True) if d['occurence'] >= 0]
    plus_sizes = [occ_to_size(G.node[u]['occurence']) for u in plus_nodes]
    nx.draw_networkx_nodes(G, pos, nodelist=plus_nodes, node_size=plus_sizes, node_color = '#CCFFCC')
    
    # negative value nodes
    neg_nodes = [u for u,d in G.nodes(data=True) if d['occurence'] < 0]
    neg_sizes = [occ_to_size(-G.node[u]['occurence']) for u in neg_nodes]
    nx.draw_networkx_nodes(G, pos, nodelist=neg_nodes, node_size=neg_sizes, node_color = '#FFCCCC')
    
    # positive value edges
    plus_edges = [(u,v) for u,v,d in G.edges(data=True) if d['weight'] >= 0]
    plus_widths = [G[u][v]['weight'] for u, v in plus_edges]
    nx.draw_networkx_edges(G, pos, edgelist=plus_edges, width=plus_widths, edge_color = 'g', alpha = 0.8)
    
    # negative value edges
    neg_edges = [(u,v) for u,v,d in G.edges(data=True) if d['weight'] < 0]
    neg_widths = [-G[u][v]['weight'] for u, v in plus_edges]
    nx.draw_networkx_edges(G, pos, edgelist=neg_edges, width=neg_widths, edge_color = 'r', alpha = 0.8)
    
    nx.draw_networkx_labels(G, pos,label_sizes=sizes);



In [13]:

    
plt.figure(1,figsize=(10, 10))
draw_semantic_network(G)

Can we use nframing to compare the meaning of different documents? I hope so! Otherwise, this is a big waste of time.

Let's experiment with some ways of using nframings to look at the differences between two chapters. Can you tell from these visualizations what these two chapters are specifically about? Are there concepts or characters represented by the same word that are different in each chapter?



In [14]:

    
two_chapters = (3,13)

fig = plt.figure(120,figsize=(15, 8))

plt.subplot(121)
nframe_pcolor(*nframes[two_chapters[0]])
plt.subplot(122)
nframe_pcolor(*nframes[two_chapters[1]])

plt.show()



In [15]:

    
G2 = nframe_to_semantic_network(*nframes[13])



In [16]:

    
plt.figure(120,figsize=(15, 8))
plt.subplot(121)
draw_semantic_network(nframe_to_semantic_network(*nframes[two_chapters[0]]))
plt.subplot(122)
draw_semantic_network(nframe_to_semantic_network(*nframes[two_chapters[1]]))
plt.show()

To test whether these kinds of visualizations are useful, we are going to construct something called a semantic network diff.

You might be familiar with the idea of a 'diff' in the context of text or code. In that context, a 'diff' shows you which lines have changed between two versions of a document.

Adapting this idea to semantic networks, here we look at the differences between two networks: which nodes and edges does one network have that the other does not have, and vice versa?



In [17]:

    
def semantic_network_difference(g1,g2):
    g3 = g1.copy()
    
    for node,data in g2.nodes(data=True):
        if node in g1:
            g3.node[node]['occurence'] = g3.node[node]['occurence'] - data['occurence']
        else:
            g3.add_node(node,occurence=-data['occurence'])
            
    
    for u,v,data in g2.edges(data=True):
        if g1.has_edge(u,v):
            g3[u][v]['weight'] = g3[u][v]['weight'] - data['weight']
        else:
            g3.add_edge(u,v,weight=-data['weight'])
    
    return g3

Now we can walk through the whole book and look at the diff for each pair of consequetive chapters.



In [18]:

    
def chapter_diff(chx,chy):
    gx = nframe_to_semantic_network(*nframes[chx])
    gy = nframe_to_semantic_network(*nframes[chy])

    gz = semantic_network_difference(gx,gy)
    
    return gz



In [19]:

    
for n in range(len(chapters(book)) - 1):
    plt.figure(120,figsize=(13, 5))
    
    print chapters(book)[n]
    gd = chapter_diff(n,n+1)
    
    draw_semantic_network(gd)
    plt.show()

#print chapters(book)[n+1]









    



Yertle the Turtle -----------------






    












    



by dr. seuss






    












    



On the far-away island of Sala-ma-Sond, Yertle the Turtle was king of the pond. A nice little pond.  It was clean.  It was neat. The water was warm.  There was plenty to eat. The turtles had everything turtles might need. And they were all happy.  Quite happy indeed.






    












    



They were... untill Yertle, the king of them all, Decided the kingdom he ruled was too small. "I'm ruler", said Yertle, "of all that I see. But I don't see enough.  That's the trouble with me. With this stone for a throne, I look down on my pond But I cannot look down on the places beyond. This throne that I sit on is too, too low down. It ought to be higher!" he said with a frown. "If I could sit high, how much greater I'd be! What a king! I'd be ruler of all that I see!"






    












    



So Yertle, the Turtle King, lifted his hand And Yertle, the Turtle King, gave a command. He ordered nine turtles to swim to his stone And, using these turtles, he built a new throne. He made each turtle stand on another one's back And he piled them all up in a nine-turtle stack. And then Yertle climbed up.  He sat down on the pile. What a wonderful view! He could see 'most a mile!






    












    



"All mine!" Yertle cried.  "Oh, the things I now rule! I'm the king of a cow! And I'm the king of a mule! I'm the king of a house! And, what's more, beyond that I'm the king of a blueberry bush and a cat! I'm Yertle the Turtle! Oh, marvelous me! For I am the ruler of all that I see!"






    












    



And all through the morning, he sat up there high Saying over and over, "A grat king am I!" Until 'long about noon.  Then he heard a faint sigh. "What's that?" snapped the king And he looked down the stack. And he saw, at the bottom, a turtle named Mack. Just a part of his throne.  And this plain little turtle Looked up and he said, "Beg your pardon, King Yertle. I've pains in my back and my shoulders and knees. How long must we stand here, Your Majesty, please?"






    












    



"SILENCE!" the King of the Turtles barked back. "I'm king, and you're only a turtle named Mack."






    












    



"You stay in your place while I sit here and rule. I'm the king of a cow! And I'm the king of a mule! I'm the king of a house! And a bush! And a cat! But that isn't all.  I'll do better than that! My throne shall be higher!" his royal voice thundered, "So pile up more turtles! I want 'bout two hundred!"






    












    



"Turtles! More turtles!" he bellowed and brayed. And the turtles 'way down in the pond were afraid. They trembled.  They shook.  But they came.  They obeyed. >From all over the pond, they came swimming by dozens. Whole families of turtles, with uncles and cousins. And all of them stepped on the head of poor Mack. One after another, they climbed up the stack.






    












    



Then Yertle the Turtle was perched up so high, He could see fourty miles from his throne in the sky! "Hooray!" shouted Yertle.  "I'm the king of the trees! I'm king of the birds! And I'm king of the bees! I'm king of the butterflies! King of the air! Ah, me! What a throne! What a wonderful chair! I'm Yertle the Turtle! Oh, marvelous me! For I am the ruler of all that I see!"






    












    



Then again, from below, in the great heavy stack, Came a groan from that plain little turtle named Mack. "Your Majesty, please... I don't like to complain, But down here below, we are feeling great pain. I know, up on top you are seeing great sights, But down here at the bottom we, too, should have rights. We turtles can't stand it.  Our shells will all crack! Besides, we need food.  We are starving!" groaned Mack.






    












    



"You hush up your mouth!" howled the mighty King Yertle. "You've no right to talk to the world's highest turtle. I rule from the clouds! Over land! Over sea! There's nothing, no, NOTHING, that's higher than me!"






    












    



But, while he was shouting, he saw with suprise That the moon of the evening was starting to rise Up over his head in the darkening skies. "What's THAT?" snorted Yertle.  "Say, what IS that thing That dares to be higher than Yertle the King? I shall not allow it! I'll go higher still! I'll build my throne higher! I can and I will! I'll call some more turtles.  I'll stack 'em to heaven! I need 'bout five thousand, six hundred and seven!"






    












    



But, as Yertle, the Turtle King, lifted his hand And started to order and give the command, That plain little turtle below in the stack, That plain little turtle whose name was just Mack, Decided he'd taken enough.  And he had. And that plain little lad got a bit mad. And that plain little Mack did a plain little thing. He burped! And his burp shook the throne of the king!






    












    



And Yertle the Turtle, the king of the trees, The king of the air and the birds and the bees, The king of a house and a cow and a mule... Well, that was the end of the Turtle King's rule! For Yertle, the King of all Sala-ma-Sond, Fell off his high throne and fell Plunk! in the pond!